|
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.〔(Use Case Study of Hive/Hadoop )〕 Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.〔(Amazon Elastic MapReduce Developer Guide )〕 ==Features== Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL〔(HiveQL Language Manual )〕 with schema on read and transparently converts queries to map/reduce, Apache Tez〔(Apache Tez )〕 and Spark jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes, including bitmap indexes.〔(Working with Students to Improve Indexing in Apache Hive )〕 By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used. Currently, there are four file formats supported in Hive, which are TEXTFILE,〔(Optimising Hadoop and Big Data with Text and HiveOptimising Hadoop and Big Data with Text and Hive )〕 SEQUENCEFILE, ORC〔(LanguageManual ORC )〕 and RCFILE.〔(Faster Big Data on Hadoop with Hive and RCFile )〕〔(Facebook's Petabyte Scale Data Warehouse using Hive and Hadoop )〕〔(【引用サイトリンク】 title=RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems )〕 Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13. Other features of Hive include: * Indexing to provide acceleration, index type including compaction and Bitmap index as of 0.10, more index types are planned. * Different storage types such as plain text, RCFile, HBase, ORC, and others. * Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution. * Operating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc. * Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions. * SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs. 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Apache Hive」の詳細全文を読む スポンサード リンク
|